{r, results='asis',include=TRUE,echo=FALSE} # if(params$isSlides != "yes"){ # cat("# RNAseq (part 1) # # --- # " # ) # # } # #

RNAseq

RNAseq offers a method to simultaneously measure the genome wide expression of annotated and novel transcripts.

igv

RNAseq for differential gene expression

A common goal in RNAseq is to identify genes and/or transcripts which are expressed at differing levels between our user defined sample groups.

Absolute quantification of the expression level of gene is subject to measurement biases (GC bias in PCR and/or RNA capture/selection) so it is most appropriate to compare the levels of the expression of same gene across conditions than to compare differing genes’ expression levels within a condition.

This commonly referred to as differential gene or transcript expression (DGE/DTE).

igv

RNAseq for differential transcript usage

A less frequent but important goal in RNAseq is to identify any changes in the relative abundance of transcripts for a gene between conditions. This may be able to identify changes in a genes’ functional role between conditions i.e. one isoform of a protein may bind and activate, another may simply bind and obstruct other active forms. We call this differential transcript usage (DTU).

igv

The data

In this session we will be reviewing data from Christina Leslie’s lab at MSKCC on T-Reg cells. This can be found on the Encode portal here

FASTQ for TReg RNAseq replicate 1 used in this session can be downloaded here

FASTQ for TReg RNAseq replicate 2 used in practical can be downloaded here

For differential splicing and transcript analysis we will use one of our datasets from a splicing factor Knockdown RNAseq experiment found at GEO here.

What we will cover

In the RNAseq sessions we will identify differential gene expression between our T-cell sample groups as well identify potential differential transcript usage. We will further identify biological functions associated with genes, and visualize these in IGV and with summary plots.

We will also identify differentially used transcript and exons from our splice factor dataset and visualize these in IGV.

In our first session we will look at two separate ways we can gain gene expression estimates from raw sequencing data as FASTQ files.

Working with raw RNAseq data


Working with raw RNAseq data

Once we have the raw FASTQ data we can use the ShortRead package to review our sequence data quality.

We have reviewed how to work with raw sequencing data in the FASTQ in Bioconductor session.

Reading raw RNAseq data

We can subsample from a FASTQ file using functions in ShortRead package.

Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a FASTQ file. Here we subsample 1 million reads. This should be enough to to do assess the quality of the sequencing.

library(ShortRead)
fq <- "https://www.encodeproject.org/files/ENCFF332KDA/@@download/ENCFF332KDA.fastq.gz"
download.file(fq, "ENCFF332KDA.fastq.gz")
fqSample <- FastqSampler("ENCFF332KDA.fastq.gz", n = 10^6)
fastq <- yield(fqSample)
fastq
## class: ShortReadQ
## length: 1000000 reads; width: 50 cycles

Raw RNAseq data QC

Now we have our ShortRead object we can produce some of our more informative plots of FASTQ properties.

First we can extract the read sequences, using the sread() function, and retrieve a matrix of base pair abundance over sequencing cycles (or length of read) using the alphabetByCycle() function.

readSequences <- sread(fastq)
readSequences_AlpbyCycle <- alphabetByCycle(readSequences)
readSequences_AlpbyCycle[1:4, 1:4]
##         cycle
## alphabet   [,1]   [,2]   [,3]   [,4]
##        A 292748 275697 261728 263146
##        C 264301 218260 246023 244318
##        G 207470 253398 239248 237533
##        T 234995 252593 252999 255001

Raw RNAseq data QC

We can plot the abundance of DNA bases (A,C,G,T) over the length of read to see any biases. First lets coerce it into an appropriate data.frame.

library(ggplot2)
AFreq <- readSequences_AlpbyCycle["A", ]
CFreq <- readSequences_AlpbyCycle["C", ]
GFreq <- readSequences_AlpbyCycle["G", ]
TFreq <- readSequences_AlpbyCycle["T", ]
toPlot <- data.frame(Count = c(AFreq, CFreq, GFreq, TFreq), Cycle = rep(1:50, 4), 
    Base = rep(c("A", "C", "G", "T"), each = 50))

Raw RNAseq data QC

We can plot the abundance of DNA bases (A,C,G,T) over the length of read to see any biases.

ggplot(toPlot, aes(y = Count, x = Cycle, colour = Base)) + geom_line() + theme_bw()

Raw RNAseq data QC

We can extract information on reads quality scores using the quality() function and translate this into a useful score summary per read using the alphabetScore() function.

readQuality <- quality(fastq)
readQualityScores <- alphabetScore(readQuality)
toPlot <- data.frame(ReadQ = readQualityScores)
head(toPlot)
##   ReadQ
## 1  1969
## 2  1913
## 3  1971
## 4  1651
## 5  1971
## 6  1997

Raw RNAseq data QC

We plot the distribution of quality scores to identify whether low quality reads should be filtered.

ggplot(toPlot, aes(x = ReadQ)) + geom_histogram() + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Raw RNAseq data QC

A final essential check of FASTQ quality is to plot the quality of sequencing over the length of the read (and so over time). Here we can use the as(MYQUALITIES,“matrix”) function to translate our ASCII encoded scores into numeric values and create a boxplot for visualization.

qualAsMatrix <- as(readQuality, "matrix")
boxplot(qualAsMatrix[1:10000, ], outline = F)

Aligning RNAseq data


Aligning RNAseq reads

Following assessment of read quality, we will want to align our reads to the genome taking into account splice junctions.

Not all RNAseq reads will align continuously against our reference genome. Instead they will map across splice junctions, so we need to use splice aware aligners, that we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.

igv

Creating a reference genome

First we need to retrieve the sequence information for the genome of interest in FASTA format.

We can use the BSgenome libraries to retrieve the full sequence information.

For the mouse mm10 genome we load the package BSgenome.Mmusculus.UCSC.mm10.

library(BSgenome.Mmusculus.UCSC.mm10)
BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## #   chr1                 chr2                 chr3                
## #   chr4                 chr5                 chr6                
## #   chr7                 chr8                 chr9                
## #   chr10                chr11                chr12               
## #   chr13                chr14                chr15               
## #   ...                  ...                  ...                 
## #   chrUn_GL456372       chrUn_GL456378       chrUn_GL456379      
## #   chrUn_GL456381       chrUn_GL456382       chrUn_GL456383      
## #   chrUn_GL456385       chrUn_GL456387       chrUn_GL456389      
## #   chrUn_GL456390       chrUn_GL456392       chrUn_GL456393      
## #   chrUn_GL456394       chrUn_GL456396       chrUn_JH584304      
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

Creating a reference genome

We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a DNAStringSet object from the retrieved sequences.

mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M"))
mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
mainChrSeqSet
## DNAStringSet object of length 22:
##          width seq                                          names               
##  [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1
##  [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2
##  [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3
##  [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4
##  [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5
##  ...       ... ...
## [18]  90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18
## [19]  61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19
## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX
## [21]  91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY
## [22]     16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM

Creating a reference genome

Now we have a DNAStringSet object we can use the writeXStringSet to create our FASTA file of sequences to align to.

writeXStringSet(mainChrSeqSet, "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa")

Creating an Rsubread index

We will be aligning using the subjunc algorithm from the folks behind subread. We can therefore use the Rsubread package. Before we attempt to align our FASTQ files, we will need to first build an index from our reference genome using the buildindex() function.

The buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from.

REMEMBER: Building an index is memory intensive and by default is set to 8GB. This may be too large for your laptop or desktop computer.

library(Rsubread)
buildindex("mm10_mainchrs", "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", memory = 8000, 
    indexSplit = TRUE)

Rsubread RNAseq alignment

We can align our raw sequence data in FASTQ format to the new FASTA file of our mm10 genome sequence using the Rsubread package. Specifically we will be using the subjunc function as it is splice aware. This means it will detect reads that span introns. This is the major difference between alignment of RNAseq and other genomic technologies like DNAseq and ChIPseq where we will use the align function.

We can also provide a SAF or GTF to our Rsubread call. Although largely unnecessary for gene expression estimation, this will allow us to capture non-canonical splice sites.

We simply need to provide a SAF/gtf to annot.ext parameter and set useAnnotation and isGTF to FALSE/TRUE depending on whether we use SAF or GTF as external data.

SAF format using exons()

SAF format is used by Rsubread to hold feature information. We simply need a table of exons’ chromosome locations (chromosome,start,end,strand) and a feature/metafeature ID.

We can retrieve exon locations and their gene ids using exons() function. We further select only exons which are annotated to exactly 1 gene.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
myExons <- exons(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c("tx_id", "gene_id"))
myExons <- myExons[lengths(myExons$gene_id) == 1]
myExons
## GRanges object with 376333 ranges and 2 metadata columns:
##                  seqnames          ranges strand |         tx_id
##                     <Rle>       <IRanges>  <Rle> | <IntegerList>
##        [1]           chr1 4807788-4807982      + |            14
##        [2]           chr1 4807823-4807982      + |            15
##        [3]           chr1 4807830-4807982      + |            16
##        [4]           chr1 4807892-4807982      + |            17
##        [5]           chr1 4807896-4807982      + |            18
##        ...            ...             ...    ... .           ...
##   [376329] chrUn_JH584304     55112-55701      - | 142445,142446
##   [376330] chrUn_JH584304     56986-57151      - | 142445,142446
##   [376331] chrUn_JH584304     58564-58835      - |        142445
##   [376332] chrUn_JH584304     58564-59690      - |        142446
##   [376333] chrUn_JH584304     59592-59667      - |        142445
##                    gene_id
##            <CharacterList>
##        [1]           18777
##        [2]           18777
##        [3]           18777
##        [4]           18777
##        [5]           18777
##        ...             ...
##   [376329]           66776
##   [376330]           66776
##   [376331]           66776
##   [376332]           66776
##   [376333]           66776
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

SAF format

With the exons GRanges we can create the appropriate data.frame of SAF format for Rsubread.

dfExons <- as.data.frame(myExons)
SAF <- data.frame(GeneID = dfExons$gene_id, Chr = dfExons$seqnames, Start = dfExons$start, 
    End = dfExons$end, Strand = dfExons$strand)

RNA alignment with external annotation

Now we have our SAF formatted exon information we can provide the SAF data.frame to the annot.ext argument, set isGTF as FALSE and set useAnnotation to TRUE.

myMapped <- subjunc("mm10_mainchrs", "ENCFF332KDA.fastq.gz", output_format = "BAM", 
    output_file = "Treg_1.bam", useAnnotation = TRUE, annot.ext = SAF, isGTF = FALSE, 
    nthreads = 4)

Sort and index reads

As before, we sort and index our files using the Rsamtools packages sortBam() and indexBam() functions respectively.

The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R.

library(Rsamtools)
sortBam("Treg_1.bam", "Sorted_Treg_1")
indexBam("Sorted_Treg_1.bam")

igv

Counting with aligned RNAseq data


Counting in gene models

With our newly aligned reads it is now possible to assign reads to genes in order to quantify a genes’ expression level (as reads) within a sample.

First we need to gather our gene models of exons and splice junctions which we can use in our later counting steps.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
class(geneExons)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"

Counting in gene models

Now we have our GRangesList with each entry corresponding to a gene and within each entry a GRanges object of the genes’ exons.

geneExons[1:2]
## GRangesList object of length 2:
## $`100009600`
## GRanges object with 9 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr9 21062393-21062717      - |    233853        <NA>
##   [2]     chr9 21062400-21062717      - |    233855        <NA>
##   [3]     chr9 21062894-21062987      - |    233856        <NA>
##   [4]     chr9 21063314-21063396      - |    233857        <NA>
##   [5]     chr9 21066024-21066377      - |    233858        <NA>
##   [6]     chr9 21066940-21067093      - |    233859        <NA>
##   [7]     chr9 21066940-21067925      - |    233860        <NA>
##   [8]     chr9 21068030-21068117      - |    233867        <NA>
##   [9]     chr9 21073075-21073096      - |    233869        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
## 
## $`100009609`
## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr7 84935565-84941088      - |    190288        <NA>
##   [2]     chr7 84943141-84943264      - |    190289        <NA>
##   [3]     chr7 84943504-84943722      - |    190290        <NA>
##   [4]     chr7 84943504-84947000      - |    190291        <NA>
##   [5]     chr7 84946200-84947000      - |    190292        <NA>
##   [6]     chr7 84947372-84947651      - |    190293        <NA>
##   [7]     chr7 84948507-84949184      - |    190294        <NA>
##   [8]     chr7 84963816-84964115      - |    190295        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Counting in gene models

We can now use the summarizeOverlaps() to count the reads in our BAM that overlap genes. We specify a BamFile object using the BamFile() function and specifying the yieldSize parameter to 10000 to control memory footprint.

The resulting RangedSummarizedExperiment object containing our counts and GRanges object.

library(GenomicAlignments)
myBam <- BamFile("Sorted_Treg_1.bam", yieldSize = 10000)
tregGeneCounts <- summarizeOverlaps(geneExons, myBam, ignore.strand = TRUE)
tregGeneCounts
## class: RangedSummarizedExperiment 
## dim: 24594 1 
## metadata(0):
## assays(1): counts
## rownames(24594): 100009600 100009609 ... 99929 99982
## rowData names(0):
## colnames(1): Sorted_Treg_1.bam
## colData names(0):

Counting in exons models

If want to start to evaluate differential transcript usage we will often evaluate counts over exons and assess for changes in exon usage.

igv

Counting in exons models

To count over exons however we will encounter an issue of assigning reads to overlapping exons.

igv

Counting in exons models

We can then work to differentiate exons by collapsing exons to the nonoverlapping, disjoint regions.

igv

Counting in exons models

We can use the GenomicFeatures’ package’s disjointExons() function with our TxDb object, TxDb.Mmusculus.UCSC.mm10.knownGene, to extract a GRanges of our disjoint exons with their associated gene_id, tx_name, exonic_part in the metadata columns.

We add some sensible names to our GRanges for use in later steps.

library(GenomicFeatures)
nonOverlappingExons <- disjointExons(TxDb.Mmusculus.UCSC.mm10.knownGene)
names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part, 
    sep = "_")
nonOverlappingExons[1:3, ]
## GRanges object with 3 ranges and 3 metadata columns:
##               seqnames            ranges strand |         gene_id
##                  <Rle>         <IRanges>  <Rle> | <CharacterList>
##   100009600_1     chr9 21062393-21062399      - |       100009600
##   100009600_2     chr9 21062400-21062717      - |       100009600
##   100009600_3     chr9 21062894-21062987      - |       100009600
##                                                                      tx_name
##                                                              <CharacterList>
##   100009600_1                      ENSMUST00000115494.2,ENSMUST00000216967.1
##   100009600_2 ENSMUST00000115494.2,ENSMUST00000216967.1,ENSMUST00000213826.1
##   100009600_3                      ENSMUST00000115494.2,ENSMUST00000213826.1
##               exonic_part
##                 <integer>
##   100009600_1           1
##   100009600_2           2
##   100009600_3           3
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Counting in exons models

We can now again use the summarizeOverlaps() function with our nonOverlapping, disjoint exons to identify a BAM file of interest.

We need to set the inter.feature to FALSE to allow us to count reads overlapping multiple features (such as a read spanning between two exons).

tregExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE, 
    inter.feature = FALSE)

Counting in exons models

Now we can review our counts from our gene-level and exon-level counting.

We retrieve a matrix of counts from either RangedSummarizedExperiment object using the assay() function.

geneCounts <- assay(tregGeneCounts)
exonCounts <- assay(tregExonCounts)
head(exonCounts, 2)
##             Sorted_Treg_1.bam
## 100009600_1                 0
## 100009600_2                 1
head(geneCounts, 2)
##           Sorted_Treg_1.bam
## 100009600                 1
## 100009609                 0

Transcript quantification with pseudo-alignment


Transcript quantification

More recently methods have developed to directly quantify transcript abundance from FASTQ files using k-mer counting.

k-mer counting offers a super fast method of gaining transcript abundance estimates, but does not produce a genomic alignment so not so useful for visualization.

As we are most often interested in gene expression changes this offers a fast alternative to counting for alignment and counting.

The two leading pieces of software for k-mer counting

  • Salmon - Relationship with DESeq2.

  • Kallisto - History of splice aware aligners and transcript quantification.

Salmon

Salmon is the latest in a set of k-mer counting tools from the Kingsford lab (Jellyfish and Sailfish) and offers a workflow for transcript quantification from FASTQ raw sequencing data and a FASTA file of transcript sequences.

Installing programs that aren’t in R

There is no R package for Salmon. It is possible to install it on Mac and Linux using the Anaconda package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them.

Though there is no Salmon R package, we can interact with the anaconda package system using the R package Herper. This was created by us here at the BRC and is part of Bioconductor.

BiocManager::install("Herper")
## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'Herper'
## Old packages: 'BH', 'brio', 'cli', 'crayon', 'crosstalk', 'desc', 'diffobj',
##   'DT', 'fansi', 'gert', 'htmltools', 'knitr', 'lifecycle', 'memoise', 'mime',
##   'pillar', 'pkgload', 'promises', 'ps', 'rappdirs', 'Rcpp', 'rlang',
##   'testthat', 'tibble', 'usethis', 'waldo', 'withr', 'xfun', 'boot', 'class',
##   'cluster', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'spatial'
library(Herper)

Install Salmon with Herper

First, we will use Herper to install Salmon with the install_CondaTools function. We just need to tell install_CondaTools what tool/s you want and the name of the environment you want to build.

salmon_paths <- install_CondaTools(tools = "salmon", env = "RNAseq_analysis")
salmon_paths
## $pathToConda
## [1] "/tmp/Rtmp6Az9Qb/rr/Test/bin/conda"
## 
## $environment
## [1] "RNAseq_analysis"
## 
## $pathToEnvBin
## [1] "/tmp/Rtmp6Az9Qb/rr/Test/envs/RNAseq_analysis/bin"

Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which Salmon will be installed. When you run the function it prints out where Salmon is installed. There are additional arguments to control where miniconda is installed using pathToMiniConda and also update an existing environment with updateEnv.

Reference for transcript quantification

Once Salmon is installed we need to make a Salmon index. First we need a FASTA file of transcript sequences. We can use the GenomicFeatures’ packages extractTranscriptSeqs function to retrieve a DNAstringSet object of transcript sequences using our sequence data from BSgenome.Mmusculus.UCSC.mm10 and gene models from TxDb.Mmusculus.UCSC.mm10.knownGene object.

allTxSeq <- extractTranscriptSeqs(BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene, 
    use.names = TRUE)
allTxSeq
## DNAStringSet object of length 142446:
##           width seq                                         names               
##      [1]   1070 AAGGAAAGAGGATAACACTT...GGTATAAAGTTAGAGAAATG ENSMUST00000193812.1
##      [2]    110 GTGCTTGCTTCGGCAACACA...ATATTGAGTAACAAAAATAT ENSMUST00000082908.1
##      [3]    480 TTCTTCTGTGTCTGAGCTAT...GGGATAAATTTACCTTCAAT ENSMUST00000192857.1
##      [4]    250 TGAAAATGGATAGCAGTTCC...TGATCAAATGATTACTCCCA ENSMUST00000161581.1
##      [5]    926 ATGGCATCGTTGGCAGACCG...AAGACAGTAAGAGAGAGTAA ENSMUST00000192183.1
##      ...    ... ...
## [142442]     99 TGTGTTGTCACAGTGCTCCA...TCCACTGTGCTGTCACAGTG ENSMUST00000184505.1
## [142443]    101 GAAATGCCTCCATGAGATCC...TCTCTGGGCTGGTAGTCTTG ENSMUST00000178705.1
## [142444]    100 GAAATGCCTCCATGAGATCC...ATCTCTGGGCTGGTAGTCTT ENSMUST00000180206.1
## [142445]   2373 AGCTGTCCCGGGGACCACAG...AAGTACTTCGAGAGAATGGC ENSMUST00000179505.7
## [142446]   4060 CTCTCTGCTGCCGGAGCAAG...TTGCAAGACATTTTGAAATT ENSMUST00000178343.1

Reference for transcript quantification

With our new DNAstringSet object of transcript sequences we can write a FASTA file for use in Salmon with the writeXStringSet function.

writeXStringSet(allTxSeq, "mm10Trans.fa")

Reference with decoy sequences

We can also provide a set of decoy sequences for building an index. This allows salmon to consider similar sequences outside of transcriptomic regions when mapping.

To do this we need to add the main chromosomes to our transcriptome FASTA for creating an index.

Reference with dummy sequence

We will also need to provide a config file containing the names of sequences to be used for decoys.

writeXStringSet(gentrome, "mm10Gentrome.fa")
write.table(as.data.frame(mainChromosomes), "decoy.txt", row.names = FALSE, col.names = FALSE, 
    quote = FALSE)

Salmon index from R

As with standard aligners, the first part of our quantification requires us to make an index from our FASTA file using the Salmon index command. We specify the transcript FASTA file (-t) and index name (-i).

Here we arrange our Salmon command into a system call from R to allow us to loop through files.

salmon index -i mm10Trans -t mm10Trans.fa

Herper allows us to run conda packages from within R. Salmon has been installed into RNAseq_analysis. So we can use this environment from R using with_CondaEnv().

fastaTx <- "mm10Trans.fa"
indexName <- "mm10Trans"

with_CondaEnv("RNAseq_analysis",
                      system2(command="salmon",args = c("index",
                        "-i",indexName,
                        "-t",fastaTx),
                        
                        stdout = TRUE))
##   [1] ""                                                                                 
##   [2] "Quant"                                                                            
##   [3] "=========="                                                                       
##   [4] "Perform dual-phase, alignment-based estimation of"                                
##   [5] "transcript abundance from RNA-seq reads"                                          
##   [6] ""                                                                                 
##   [7] "salmon quant options:"                                                            
##   [8] ""                                                                                 
##   [9] ""                                                                                 
##  [10] "alignment input options:"                                                         
##  [11] "  --discardOrphans                      [alignment-based mode only] : Discard "   
##  [12] "                                        orphan alignments in the input .  If "    
##  [13] "                                        this flag is passed, then only paired "   
##  [14] "                                        alignments will be considered toward "    
##  [15] "                                        quantification estimates.  The default "  
##  [16] "                                        behavior is to consider orphan "          
##  [17] "                                        alignments if no valid paired mappings "  
##  [18] "                                        exist."                                   
##  [19] "  -l [ --libType ] arg                  Format string describing the library "    
##  [20] "                                        type"                                     
##  [21] "  -a [ --alignments ] arg               input alignment (BAM) file(s)."           
##  [22] "  -e [ --eqclasses ] arg                input salmon weighted equivalence class"  
##  [23] "                                        file."                                    
##  [24] "  -t [ --targets ] arg                  FASTA format file containing target "     
##  [25] "                                        transcripts."                             
##  [26] ""                                                                                 
##  [27] ""                                                                                 
##  [28] "basic options:"                                                                   
##  [29] "  -v [ --version ]                      print version string"                     
##  [30] "  -h [ --help ]                         produce help message"                     
##  [31] "  -o [ --output ] arg                   Output quantification directory."         
##  [32] "  --seqBias                             Perform sequence-specific bias "          
##  [33] "                                        correction."                              
##  [34] "  --gcBias                              [beta for single-end reads] Perform "     
##  [35] "                                        fragment GC bias correction."             
##  [36] "  --posBias                             Perform positional bias correction."      
##  [37] "  -p [ --threads ] arg (=8)             The number of threads to use "            
##  [38] "                                        concurrently."                            
##  [39] "  --incompatPrior arg (=0)              This option sets the prior probability "  
##  [40] "                                        that an alignment that disagrees with "   
##  [41] "                                        the specified library type (--libType) "  
##  [42] "                                        results from the true fragment origin. "  
##  [43] "                                        Setting this to 0 specifies that "        
##  [44] "                                        alignments that disagree with the "       
##  [45] "                                        library type should be \"impossible\", "  
##  [46] "                                        while setting it to 1 says that "         
##  [47] "                                        alignments that disagree with the "       
##  [48] "                                        library type are no less likely than "    
##  [49] "                                        those that do"                            
##  [50] "  -g [ --geneMap ] arg                  File containing a mapping of "            
##  [51] "                                        transcripts to genes.  If this file is "  
##  [52] "                                        provided salmon will output both "        
##  [53] "                                        quant.sf and quant.genes.sf files, "      
##  [54] "                                        where the latter contains aggregated "    
##  [55] "                                        gene-level abundance estimates.  The "    
##  [56] "                                        transcript to gene mapping should be "    
##  [57] "                                        provided as either a GTF file, or a in "  
##  [58] "                                        a simple tab-delimited format where "     
##  [59] "                                        each line contains the name of a "        
##  [60] "                                        transcript and the gene to which it "     
##  [61] "                                        belongs separated by a tab.  The "        
##  [62] "                                        extension of the file is used to "        
##  [63] "                                        determine how the file should be "        
##  [64] "                                        parsed.  Files ending in '.gtf', '.gff'"  
##  [65] "                                        or '.gff3' are assumed to be in GTF "     
##  [66] "                                        format; files with any other extension "  
##  [67] "                                        are assumed to be in the simple format."  
##  [68] "                                        In GTF / GFF format, the "                
##  [69] "                                        \"transcript_id\" is assumed to contain " 
##  [70] "                                        the transcript identifier and the "       
##  [71] "                                        \"gene_id\" is assumed to contain the "   
##  [72] "                                        corresponding gene identifier."           
##  [73] "  --auxTargetFile arg                   A file containing a list of \"auxiliary\""
##  [74] "                                        targets.  These are valid targets "       
##  [75] "                                        (i.e., not decoys) to which fragments "   
##  [76] "                                        are allowed to map and be assigned, and"  
##  [77] "                                        which will be quantified, but for which"  
##  [78] "                                        auxiliary models like sequence-specific"  
##  [79] "                                        and fragment-GC bias correction should "  
##  [80] "                                        not be applied."                          
##  [81] "  --meta                                If you're using Salmon on a metagenomic"  
##  [82] "                                        dataset, consider setting this flag to "  
##  [83] "                                        disable parts of the abundance "          
##  [84] "                                        estimation model that make less sense "   
##  [85] "                                        for metagenomic data."                    
##  [86] ""                                                                                 
##  [87] ""                                                                                 
##  [88] "alignment-specific options:"                                                      
##  [89] "  --noErrorModel                        Turn off the alignment error model, "     
##  [90] "                                        which takes into account the the "        
##  [91] "                                        observed frequency of different types "   
##  [92] "                                        of mismatches / indels when computing "   
##  [93] "                                        the likelihood of a given alignment. "    
##  [94] "                                        Turning this off can speed up "           
##  [95] "                                        alignment-based salmon, but can harm "    
##  [96] "                                        quantification accuracy."                 
##  [97] "  --numErrorBins arg (=6)               The number of bins into which to divide"  
##  [98] "                                        each read when learning and applying "    
##  [99] "                                        the error model.  For example, a value "  
## [100] "                                        of 10 would mean that effectively, a "    
## [101] "                                        separate error model is leared and "      
## [102] "                                        applied to each 10th of the read, while"  
## [103] "                                        a value of 3 would mean that a separate"  
## [104] "                                        error model is applied to the read "      
## [105] "                                        beginning (first third), middle (second"  
## [106] "                                        third) and end (final third)."            
## [107] "  -s [ --sampleOut ]                    Write a \"postSample.bam\" file in the "  
## [108] "                                        output directory that will sample the "   
## [109] "                                        input alignments according to the "       
## [110] "                                        estimated transcript abundances. If "     
## [111] "                                        you're going to perform downstream "      
## [112] "                                        analysis of the alignments with tools "   
## [113] "                                        which don't, themselves, take fragment "  
## [114] "                                        assignment ambiguity into account, you "  
## [115] "                                        should use this output."                  
## [116] "  -u [ --sampleUnaligned ]              In addition to sampling the aligned "     
## [117] "                                        reads, also write the un-aligned reads "  
## [118] "                                        to \"postSample.bam\"."                   
## [119] "  --gencode                             This flag will expect the input "         
## [120] "                                        transcript fasta to be in GENCODE "       
## [121] "                                        format, and will split the transcript "   
## [122] "                                        name at the first '|' character.  These"  
## [123] "                                        reduced names will be used in the "       
## [124] "                                        output and when looking for these "       
## [125] "                                        transcripts in a gene to transcript "     
## [126] "                                        GTF."                                     
## [127] "  --scoreExp arg (=1)                   The factor by which sub-optimal "         
## [128] "                                        alignment scores are downweighted to "    
## [129] "                                        produce a probability.  If the best "     
## [130] "                                        alignment score for the current read is"  
## [131] "                                        S, and the score for a particular "       
## [132] "                                        alignment is w, then the probability "    
## [133] "                                        will be computed porportional to exp( -"  
## [134] "                                        scoreExp * (S-w) ). NOTE: This flag "     
## [135] "                                        only has an effect if you are parsing "   
## [136] "                                        alignments produced by salmon itself "    
## [137] "                                        (i.e. pufferfish or RapMap in "           
## [138] "                                        selective-alignment mode)."               
## [139] "  --mappingCacheMemoryLimit arg (=2000000)"                                       
## [140] "                                        If the file contained fewer than this "   
## [141] "                                        many mapped reads, then just keep the "   
## [142] "                                        data in memory for subsequent rounds of"  
## [143] "                                        inference. Obviously, this value should"  
## [144] "                                        not be too large if you wish to keep a "  
## [145] "                                        low memory usage, but setting it large "  
## [146] "                                        enough to accommodate all of the mapped"  
## [147] "                                        read can substantially speed up "         
## [148] "                                        inference on \"small\" files that contain"
## [149] "                                        only a few million reads."                
## [150] ""                                                                                 
## [151] ""                                                                                 
## [152] "advanced options:"                                                                
## [153] "  --alternativeInitMode                 [Experimental]: Use an alternative "      
## [154] "                                        strategy (rather than simple "            
## [155] "                                        interpolation between) the online and "   
## [156] "                                        uniform abundance estimates to "          
## [157] "                                        initialize the EM / VBEM algorithm."      
## [158] "  --auxDir arg (=aux_info)              The sub-directory of the quantification"  
## [159] "                                        directory where auxiliary information "   
## [160] "                                        e.g. bootstraps, bias parameters, etc. "  
## [161] "                                        will be written."                         
## [162] "  --skipQuant                           Skip performing the actual transcript "   
## [163] "                                        quantification (including any Gibbs "     
## [164] "                                        sampling or bootstrapping)."              
## [165] "  --dumpEq                              Dump the simple equivalence class "       
## [166] "                                        counts that were computed during "        
## [167] "                                        mapping or alignment."                    
## [168] "  -d [ --dumpEqWeights ]                Dump conditional probabilities "          
## [169] "                                        associated with transcripts when "        
## [170] "                                        equivalence class information is being "  
## [171] "                                        dumped to file. Note, this will dump "    
## [172] "                                        the factorization that is actually used"  
## [173] "                                        by salmon's offline phase for "           
## [174] "                                        inference.  If you are using "            
## [175] "                                        range-factorized equivalence classes "    
## [176] "                                        (the default) then the same transcript "  
## [177] "                                        set may appear multiple times with "      
## [178] "                                        different associated conditional "        
## [179] "                                        probabilities."                           
## [180] "  --minAssignedFrags arg (=10)          The minimum number of fragments that "    
## [181] "                                        must be assigned to the transcriptome "   
## [182] "                                        for quantification to proceed."           
## [183] "  --reduceGCMemory                      If this option is selected, a more "      
## [184] "                                        memory efficient (but slightly slower) "  
## [185] "                                        representation is used to compute "       
## [186] "                                        fragment GC content. Enabling this will"  
## [187] "                                        reduce memory usage, but can also "       
## [188] "                                        reduce speed.  However, the results "     
## [189] "                                        themselves will remain the same."         
## [190] "  --biasSpeedSamp arg (=5)              The value at which the fragment length "  
## [191] "                                        PMF is down-sampled when evaluating "     
## [192] "                                        sequence-specific & GC fragment bias.  "  
## [193] "                                        Larger values speed up effective length"  
## [194] "                                        correction, but may decrease the "        
## [195] "                                        fidelity of bias modeling results."       
## [196] "  --fldMax arg (=1000)                  The maximum fragment length to consider"  
## [197] "                                        when building the empirical "             
## [198] "                                        distribution"                             
## [199] "  --fldMean arg (=250)                  The mean used in the fragment length "    
## [200] "                                        distribution prior"                       
## [201] "  --fldSD arg (=25)                     The standard deviation used in the "      
## [202] "                                        fragment length distribution prior"       
## [203] "  -f [ --forgettingFactor ] arg (=0.65000000000000002)"                           
## [204] "                                        The forgetting factor used in the "       
## [205] "                                        online learning schedule.  A smaller "    
## [206] "                                        value results in quicker learning, but "  
## [207] "                                        higher variance and may be unstable.  A"  
## [208] "                                        larger value results in slower learning"  
## [209] "                                        but may be more stable.  Value should "   
## [210] "                                        be in the interval (0.5, 1.0]."           
## [211] "  --initUniform                         initialize the offline inference with "   
## [212] "                                        uniform parameters, rather than seeding"  
## [213] "                                        with online parameters."                  
## [214] "  --maxOccsPerHit arg (=1000)           When collecting \"hits\" (MEMs), hits "   
## [215] "                                        having more than maxOccsPerHit "          
## [216] "                                        occurrences won't be considered."         
## [217] "  -w [ --maxReadOcc ] arg (=200)        Reads \"mapping\" to more than this many "
## [218] "                                        places won't be considered."              
## [219] "  --noLengthCorrection                  [experimental] : Entirely disables "      
## [220] "                                        length correction when estimating the "   
## [221] "                                        abundance of transcripts.  This option "  
## [222] "                                        can be used with protocols where one "    
## [223] "                                        expects that fragments derive from "      
## [224] "                                        their underlying targets without regard"  
## [225] "                                        to that target's length (e.g. QuantSeq)"  
## [226] "  --noEffectiveLengthCorrection         Disables effective length correction "    
## [227] "                                        when computing the probability that a "   
## [228] "                                        fragment was generated from a "           
## [229] "                                        transcript.  If this flag is passed in,"  
## [230] "                                        the fragment length distribution is not"  
## [231] "                                        taken into account when computing this "  
## [232] "                                        probability."                             
## [233] "  --noSingleFragProb                    Disables the estimation of an "           
## [234] "                                        associated fragment length probability "  
## [235] "                                        for single-end reads or for orphaned "    
## [236] "                                        mappings in paired-end libraries.  The "  
## [237] "                                        default behavior is to consider the "     
## [238] "                                        probability of all possible fragment "    
## [239] "                                        lengths associated with the retained "    
## [240] "                                        mapping.  Enabling this flag (i.e. "      
## [241] "                                        turning this default behavior off) will"  
## [242] "                                        simply not attempt to estimate a "        
## [243] "                                        fragment length probability in such "     
## [244] "                                        cases."                                   
## [245] "  --noFragLengthDist                    [experimental] : Don't consider "         
## [246] "                                        concordance with the learned fragment "   
## [247] "                                        length distribution when trying to "      
## [248] "                                        determine the probability that a "        
## [249] "                                        fragment has originated from a "          
## [250] "                                        specified location.  Normally, "          
## [251] "                                        Fragments with unlikely lengths will be"  
## [252] "                                        assigned a smaller relative probability"  
## [253] "                                        than those with more likely lengths.  "   
## [254] "                                        When this flag is passed in, the "        
## [255] "                                        observed fragment length has no effect "  
## [256] "                                        on that fragment's a priori "             
## [257] "                                        probability."                             
## [258] "  --noBiasLengthThreshold               [experimental] : If this option is "      
## [259] "                                        enabled, then no (lower) threshold will"  
## [260] "                                        be set on how short bias correction can"  
## [261] "                                        make effective lengths. This can "        
## [262] "                                        increase the precision of bias "          
## [263] "                                        correction, but harm robustness.  The "   
## [264] "                                        default correction applies a threshold."  
## [265] "  --numBiasSamples arg (=2000000)       Number of fragment mappings to use when"  
## [266] "                                        learning the sequence-specific bias "     
## [267] "                                        model."                                   
## [268] "  --numAuxModelSamples arg (=5000000)   The first <numAuxModelSamples> are used"  
## [269] "                                        to train the auxiliary model parameters"  
## [270] "                                        (e.g. fragment length distribution, "     
## [271] "                                        bias, etc.).  After ther first "          
## [272] "                                        <numAuxModelSamples> observations the "   
## [273] "                                        auxiliary model parameters will be "      
## [274] "                                        assumed to have converged and will be "   
## [275] "                                        fixed."                                   
## [276] "  --numPreAuxModelSamples arg (=5000)   The first <numPreAuxModelSamples> will "  
## [277] "                                        have their assignment likelihoods and "   
## [278] "                                        contributions to the transcript "         
## [279] "                                        abundances computed without applying "    
## [280] "                                        any auxiliary models.  The purpose of "   
## [281] "                                        ignoring the auxiliary models for the "   
## [282] "                                        first <numPreAuxModelSamples> "           
## [283] "                                        observations is to avoid applying these"  
## [284] "                                        models before their parameters have "     
## [285] "                                        been learned sufficiently well."          
## [286] "  --useEM                               Use the traditional EM algorithm for "    
## [287] "                                        optimization in the batch passes."        
## [288] "  --useVBOpt                            Use the Variational Bayesian EM "         
## [289] "                                        [default]"                                
## [290] "  --rangeFactorizationBins arg (=4)     Factorizes the likelihood used in "       
## [291] "                                        quantification by adopting a new notion"  
## [292] "                                        of equivalence classes based on the "     
## [293] "                                        conditional probabilities with which "    
## [294] "                                        fragments are generated from different "  
## [295] "                                        transcripts.  This is a more "            
## [296] "                                        fine-grained factorization than the "     
## [297] "                                        normal rich equivalence classes.  The "   
## [298] "                                        default value (4) corresponds to the "    
## [299] "                                        default used in Zakeri et al. 2017 "      
## [300] "                                        (doi: 10.1093/bioinformatics/btx262), "   
## [301] "                                        and larger values imply a more "          
## [302] "                                        fine-grained factorization.  If range "   
## [303] "                                        factorization is enabled, a common "      
## [304] "                                        value to select for this parameter is "   
## [305] "                                        4. A value of 0 signifies the use of "    
## [306] "                                        basic rich equivalence classes."          
## [307] "  --numGibbsSamples arg (=0)            Number of Gibbs sampling rounds to "      
## [308] "                                        perform."                                 
## [309] "  --noGammaDraw                         This switch will disable drawing "        
## [310] "                                        transcript fractions from a Gamma "       
## [311] "                                        distribution during Gibbs sampling.  In"  
## [312] "                                        this case the sampler does not account "  
## [313] "                                        for shot-noise, but only assignment "     
## [314] "                                        ambiguity"                                
## [315] "  --numBootstraps arg (=0)              Number of bootstrap samples to "          
## [316] "                                        generate. Note: This is mutually "        
## [317] "                                        exclusive with Gibbs sampling."           
## [318] "  --bootstrapReproject                  This switch will learn the parameter "    
## [319] "                                        distribution from the bootstrapped "      
## [320] "                                        counts for each sample, but will "        
## [321] "                                        reproject those parameters onto the "     
## [322] "                                        original equivalence class counts."       
## [323] "  --thinningFactor arg (=16)            Number of steps to discard for every "    
## [324] "                                        sample kept from the Gibbs chain. The "   
## [325] "                                        larger this number, the less chance "     
## [326] "                                        that subsequent samples are "             
## [327] "                                        auto-correlated, but the slower "         
## [328] "                                        sampling becomes."                        
## [329] "  -q [ --quiet ]                        Be quiet while doing quantification "     
## [330] "                                        (don't write informative output to the "  
## [331] "                                        console unless something goes wrong)."    
## [332] "  --perTranscriptPrior                  The prior (either the default or the "    
## [333] "                                        argument provided via --vbPrior) will "   
## [334] "                                        be interpreted as a transcript-level "    
## [335] "                                        prior (i.e. each transcript will be "     
## [336] "                                        given a prior read count of this value)"  
## [337] "  --perNucleotidePrior                  The prior (either the default or the "    
## [338] "                                        argument provided via --vbPrior) will "   
## [339] "                                        be interpreted as a nucleotide-level "    
## [340] "                                        prior (i.e. each nucleotide will be "     
## [341] "                                        given a prior read count of this value)"  
## [342] "  --sigDigits arg (=3)                  The number of significant digits to "     
## [343] "                                        write when outputting the "               
## [344] "                                        EffectiveLength and NumReads columns"     
## [345] "  --vbPrior arg (=0.01)                 The prior that will be used in the VBEM"  
## [346] "                                        algorithm.  This is interpreted as a "    
## [347] "                                        per-transcript prior, unless the "        
## [348] "                                        --perNucleotidePrior flag is also "       
## [349] "                                        given.  If the --perNucleotidePrior "     
## [350] "                                        flag is given, this is used as a "        
## [351] "                                        nucleotide-level prior.  If the default"  
## [352] "                                        is used, it will be divided by 1000 "     
## [353] "                                        before being used as a nucleotide-level"  
## [354] "                                        prior, i.e. the default per-nucleotide "  
## [355] "                                        prior will be 1e-5."                      
## [356] "  --writeOrphanLinks                    Write the transcripts that are linked "   
## [357] "                                        by orphaned reads."                       
## [358] "  --writeUnmappedNames                  Write the names of un-mapped reads to "   
## [359] "                                        the file unmapped_names.txt in the "      
## [360] "                                        auxiliary directory."                     
## [361] ""

Salmon index from R

To create index with decoys we provide our transcriptome + chromosome FASTA as well as our decoy file to the -d paramter.

with_CondaEnv("RNAseq_analysis",
                      system2(command="salmon",args = c("index",
                        "-i",indexName,
                        "-t",fastaTx,
                        "-d decoy.txt"),
                        
                        stdout = TRUE))

Salmon quant from R

To quantify transcript abundance we use the Salmon quant command.

We specify the index location (-i), the reads to quantify (-r), the output directory (-o) and the library type as automatic detection of type (-l A).

salmon quant -i mm10Trans -r ~/Downloads/ENCFF332KDA.fastq.gz -o TReg_1_Quant -l A
fq <- "ENCFF332KDA.fastq.gz"
outDir <- "TReg_1_Quant"

with_CondaEnv("RNAseq_analysis",
                      system2(command="salmon",args = c("quant",
                        "-i",indexName,
                        "-o",outDir,
                        "-l A",
                        "-r",fq),
                        
                        stdout = TRUE))

Salmon outputs

The output from Salmon is our quantification of transcripts in the quant.sf file in the user defined output directory.

We will use a new package tximport to handle these counts in next session but for now we can review file in standard R libraries.

myQuant <- read.delim("TReg_1_Quant/quant.sf")
myQuant[1:3, ]
##                   Name Length EffectiveLength    TPM NumReads
## 1 ENSMUST00000193812.1   1070         820.000 0.0000        0
## 2 ENSMUST00000082908.1    110           3.749 0.0000        0
## 3 ENSMUST00000192857.1    480         230.000 0.8368        7

Time for an exercise

Link_to_exercises

Link_to_answers